New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactor Runners, introduce `Task` class #4206

Merged

merelcht merged 26 commits into main from runners

Nov 1, 2024

+495 −362

Member

merelcht commented Oct 3, 2024 •

edited

Loading

Description

Introduced the Task class, which encapsulates what is actually run in each of the runners to make the Runners code more readable.

I've always found the code in runners a bit hard to navigate. The (simplified) flow before my refactor for running a node was:

graph TD
    A[run/run_node in runner.py] --> B[_run in sequential_runner.py]
    A[run/run_node in runner.py] --> C[_run in thread_runner.py]
    A[run/run_node in runner.py] --> D[_run in parallel_runner.py]
    B -->  A[run/run_node in runner.py]
    C -->     A[run/run_node in runner.py]
    D -->     A[run/run_node in runner.py]
    A--> F[run in node.py]

Now it's:

graph TD
    A[run in runner.py] --> B[_run in sequential_runner.py]
    A[run in runner.py] --> C[_run in thread_runner.py]
    A[run in runner.py] --> D[_run in parallel_runner.py]
    B --> E[execute in task.py]
    C --> E[execute in task.py]
    D --> E[execute in task.py]
    E --> F[run in node.py]

Development notes

Created Task which contains all that is needed to execute a Node.
Created _release_datasets() to remove duplicated code across the runners.
Moved all helper methods related to running a node from runner.py to the Task class.
Moved the helper methods in parallel_runner.py to task.py
Marked run_node() as deprecated, because it's replaced by Task.execute() and no longer called directly anywhere other than tests.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

merelcht added 6 commits

September 6, 2024 15:27


          Introduce task class + small refactoring

5eee2d0

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Merge branch 'refs/heads/main' into runners

4a35750

# Conflicts:
#	kedro/runner/runner.py


          Merge in main and make release_datasets private

f094fe7

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Fix session tests

666cad3

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Fix lint

ffc1113

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Make Task runnable and call inside runners

7d0f3b6

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht commented

View reviewed changes

kedro/runner/sequential_runner.py Outdated

@@ @@ -75,21 +75,22 @@ def _run( @@
                       for exec_index, node in enumerate(nodes):
                           try:
-                              run_node(node, catalog, hook_manager, self._is_async, session_id)
+                              from kedro.runner.task import Task

Member Author

merelcht Oct 8, 2024

This is needed because I refactored run_node in the runner to use Task and moved methods to Task as well. run_node isn't actually needed anymore, but removing it would be a breaking change. I could undo the changes to runner.py which removes the import from Task and then allows it to be imported inside the runner implementations again. The downside is that we'd have duplicated code in runner.py and task.py.

merelcht and others added 4 commits

October 8, 2024 16:13


          Merge branch 'main' into runners

a40c6c6

Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>


          Fix lint

a40f9fb

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Fix parallel runner

5eb7cab

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Fix test coverage

0fa8cea

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht requested a review from idanov

October 9, 2024 12:36

Member

deepyaman commented Oct 9, 2024

Is there a relevant issue for this? Nothing against the idea of introducing the "task" abstraction; just interested to better understand what motivates it.

Member Author

merelcht commented Oct 11, 2024

Is there a relevant issue for this? Nothing against the idea of introducing the "task" abstraction; just interested to better understand what motivates it.

I'll update the description when the PR is ready for review.

merelcht and others added 4 commits

October 15, 2024 12:40


          Move helper methods from parallel runner to task

54389ee

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Mark run_node as deprecated

94e4894

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Merge branch 'main' into runners

69699e5

Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>


          Merge branch 'main' into runners

299d4d2

merelcht marked this pull request as ready for review

October 15, 2024 13:31

merelcht self-assigned this

merelcht changed the title ~~[WIP] Refactor Runners, introduce Task~~ Refactor Runners, introduce Task class


          Fix tests

7346f9a

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht linked an issue

that may be closed by this pull request

Refactor Runners #4224

Closed

merelcht and others added 2 commits

October 15, 2024 15:10


          Fix tests

7a0e4d9

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Merge branch 'main' into runners

30a0d9a

noklam self-requested a review

October 16, 2024 13:36

Contributor

noklam commented Oct 16, 2024

In general not much concern the refactoring looks simple enough and is a better abstraction than the previous one. I would like to run the benchmark once #4210 is ready to make sure we don't run into memory/perf issue.

Side note: I think the current Task is a bit weird with node/hook manager/catalog, but I understand it's necessary for keeping it non-breaking so it's all good, maybe something to revise when we are closer to 0.20.0

merelcht and others added 3 commits

October 17, 2024 14:16


          Refactor helper methods to go inside Task, making hook_manager an opt…

14b10d8

…ional argument, and adding parallel as boolean flag

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Merge branch 'main' into runners

bf6fa5c


          Fix lint and handle no hook_manager

9cf2ed0

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>


          Clean up imports

e94d510

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht requested a review from ElenaKhaustova

October 17, 2024 16:33

ElenaKhaustova approved these changes

View reviewed changes

Contributor

ElenaKhaustova left a comment

Looks much cleaner now, thank you @merelcht!

Left one minor suggestion.

kedro/runner/runner.py Show resolved Hide resolved

noklam reviewed

View reviewed changes

Contributor

noklam left a comment

The PR looks good, I still want to wait #4210 to benchmark the runner if it's not too urgent.

Going through the runner code again, I have some thought (not specific to the refactoring).

Q1: With modern Python like asyncio, the first question I have is are we really doing async in Kedro? I see most of the async references are associated with threading, but I don't think they are the same thing. This maybe something that we could consider and I think in general async is simpler and was designed exactly for I/O taskes. Things may be a bit tricky since Python 3.13 start having a GIL free thread.

Q2: is_async in SequentialRunner is only for a limited scope, i.e. if nodes has mulitiple dataset, the order of loading doesn't matter and can be loaded in an async manner (Why isn't it the default already?)

Q3: There is another level of async, which is at multi-node level where each node is being executed asynchronously, again async maybe a simpler solution.

Member Author

merelcht commented Oct 24, 2024

The PR looks good, I still want to wait #4210 to benchmark the runner if it's not too urgent.

Yes of course! We can definitely wait for that.

Going through the runner code again, I have some thought (not specific to the refactoring).

Q1: With modern Python like asyncio, the first question I have is are we really doing async in Kedro? I see most of the async references are associated with threading, but I don't think they are the same thing. This maybe something that we could consider and I think in general async is simpler and was designed exactly for I/O taskes. Things may be a bit tricky since Python 3.13 start having a GIL free thread.

Q2: is_async in SequentialRunner is only for a limited scope, i.e. if nodes has mulitiple dataset, the order of loading doesn't matter and can be loaded in an async manner (Why isn't it the default already?)

Q3: There is another level of async, which is at multi-node level where each node is being executed asynchronously, again async maybe a simpler solution.

These are excellent points, thanks @noklam ! I haven't had time to continue refactoring, but I will definitely take this on as part of it.

noklam and others added 2 commits

October 28, 2024 18:38


          Merge branch 'main' into runners

7c65442


          Merge branch 'main' into runners

12d2614

merelcht added the performance label

Member Author

merelcht commented Oct 30, 2024 •

edited

Loading

@noklam I don't know if I did this correct by I ran asv run (against main) and then asv show main and got:

benchmark_runner.RunnerMemorySuite.mem_runners [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ========
        runner
  ------------------ --------
   SequentialRunner     0
     ThreadRunner     failed
    ParallelRunner      0
  ================== ========
  started: 2024-10-30 15:53:21, duration: 16.2s

benchmark_runner.RunnerMemorySuite.peakmem_runners [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ========
        runner
  ------------------ --------
   SequentialRunner    102M
     ThreadRunner     failed
    ParallelRunner    98.1M
  ================== ========
  started: 2024-10-30 15:53:37, duration: 16.1s

benchmark_runner.RunnerTimeSuite.time_compute_bound_runner [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ============
        runner
  ------------------ ------------
   SequentialRunner   7.34±0.01s
     ThreadRunner       failed
    ParallelRunner    2.03±0.3s
  ================== ============
  started: 2024-10-30 15:53:53, duration: 1.11m

benchmark_runner.RunnerTimeSuite.time_io_bound_runner [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ============
        runner
  ------------------ ------------
   SequentialRunner   20.2±0.01s
     ThreadRunner       failed
    ParallelRunner    3.23±0.01s
  ================== ============
  started: 2024-10-30 15:54:25, duration: 1.90m

Then I did asv run main..runners -b runner --step 2 and asv show runners and got:

benchmark_runner.RunnerMemorySuite.mem_runners [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ========
        runner
  ------------------ --------
   SequentialRunner     0
     ThreadRunner     failed
    ParallelRunner      0
  ================== ========
  started: 2024-10-30 16:14:54, duration: 1.91m

benchmark_runner.RunnerMemorySuite.peakmem_runners [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ========
        runner
  ------------------ --------
   SequentialRunner    102M
     ThreadRunner     failed
    ParallelRunner    98.3M
  ================== ========
  started: 2024-10-30 16:16:49, duration: 15.9s

benchmark_runner.RunnerTimeSuite.time_compute_bound_runner [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ============
        runner
  ------------------ ------------
   SequentialRunner   7.47±0.07s
     ThreadRunner       failed
    ParallelRunner    2.06±0.01s
  ================== ============
  started: 2024-10-30 16:17:05, duration: 1.07m

benchmark_runner.RunnerTimeSuite.time_io_bound_runner [M-HLJY4F7K07/virtualenv-py3.11-kedro-datasets[pandas]]
  1/3 failed
  ================== ============
        runner
  ------------------ ------------
   SequentialRunner   20.2±0.01s
     ThreadRunner       failed
    ParallelRunner    3.23±0.01s
  ================== ============
  started: 2024-10-30 16:17:36, duration: 1.90m

So overall it looks like there isn't really a difference in performance.

From the pipeline test: https://github.com/kedro-org/kedro/actions/runs/11594561241/job/32281088073?pr=4206 you can also see there's hardly a difference.

noklam and others added 3 commits

October 31, 2024 12:36


          update readme for benchmark

f409d2f

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>


          Merge branch 'runners' of github.com:kedro-org/kedro into runners

b8cd928

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>


          Merge branch 'main' into runners

e6b5aab

merelcht requested a review from noklam

October 31, 2024 16:56

idanov approved these changes

View reviewed changes

merelcht merged commit 18bde07 into main

34 checks passed

merelcht deleted the runners branch

November 1, 2024 10:48

merelcht mentioned this pull request

Use asyncio for async operations in Runners #4289

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels